fix(flows): make paper_flow produce a valid Paper under strict structured output#79
Open
savycompany wants to merge 1 commit into
Open
fix(flows): make paper_flow produce a valid Paper under strict structured output#79savycompany wants to merge 1 commit into
savycompany wants to merge 1 commit into
Conversation
paper_flow set the agent's output_type to the canonical knowledge.Paper,
whose store schema is hostile to OpenAI structured output:
- Paper.nodes is a dict[UUID, TreeNode]; a dict serialises to
additionalProperties, which the Agents SDK strict-JSON-schema mode
rejects ("additionalProperties should not be set for object types").
- With strict mode off, models do not emit RFC-4122 UUIDs, so the UUID
id fields fail Pydantic validation on free-form ids like "intro_node".
Either way the default flow raised before returning, so no real
extraction completed end-to-end.
Introduce a strict-schema-safe LLM target in the apex layer and keep the
canonical store schema untouched (UUID identity + dict map are
load-bearing for dedup/indexing per the storage design doc):
- quantmind/flows/_paper_draft.py: PaperDraft/PaperDraftNode (nested
children, no id bookkeeping) + draft_to_paper(), which assigns real
UUIDs, wires parent_id/children_ids, and injects provenance (source,
arxiv_id, authors, published-date as_of) the flow already knows from
the fetch layer instead of trusting the model to author it.
- paper_flow now targets PaperDraft by default and lifts the result into
a Paper. A caller-supplied output_type still bypasses the draft and is
returned verbatim (isinstance(result, Paper) pass-through), so the
existing override contract is preserved.
As a side effect this fixes empty Paper.arxiv_id/authors: previously the
model was asked to author provenance and routinely left it blank.
Verified end-to-end against the live Agents SDK (gpt-4o-mini): a real
arXiv paper now extracts into a 19-node Paper with UUID ids and survives
a JSON store round-trip. Full verify harness green (248 tests, 90% cov).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
paper_flowcurrently cannot complete an end-to-end extraction against the OpenAI Agents SDK, because it sets the agent'soutput_typeto the canonicalknowledge.Paper— a store schema that is hostile to structured output:Paper.nodesis adict[UUID, TreeNode]. Adictserialises toadditionalProperties, which the SDK's strict JSON-schema mode rejects:UserError: additionalProperties should not be set for object types.UUIDid fields fail Pydantic validation on free-form ids like"intro_node":Input should be a valid UUID ... 'root'.Either way the default flow raises before returning — no real paper extracts. This is the migration gap flagged in the README note (#71).
Approach
Keep the canonical store schema untouched — its UUID identity + flat
dictmap are load-bearing for dedup/indexing perdocs/design/en/storage.md— and fix the issue at the apex (flows) layer, which is where the LLM boundary actually lives (and the only layer import-linter allows to importknowledge):quantmind/flows/_paper_draft.py(new):PaperDraft/PaperDraftNode, a strict-schema-safe extraction target. Children are embedded (a real nested tree) rather than referenced by id, so the model never maintains a flat id map — eliminating dangling-reference / duplicate-id / missing-root failure modes.draft_to_paper()lifts a validated draft into a canonicalPaper: assigns real UUIDs, wiresparent_id/children_ids/position, and injects provenance (source,arxiv_id,authors, publication-dateas_of) the flow already knows from the fetch layer instead of asking the model to author it.quantmind/flows/paper.py: targetsPaperDraftby default and lifts the result into aPaper. A caller-suppliedoutput_typestill bypasses the draft and is returned verbatim (isinstance(result, Paper)pass-through), preserving the existing override contract. The arxiv branch of_fetch_and_formatnow propagatespublished_atsoas_ofis the real cutoff.As a side effect this also fixes empty
Paper.arxiv_id/Paper.authors: previously the model was asked to author provenance and routinely left it blank.Verification
ruff format/ruff check/basedpyright/lint-imports/pytest— 248 passed, 90% coverage (15 new unit tests for the draft schema, converter, provenance injection, and the flow integration path).gpt-4o-mini, default path, no strict-schema overrides): a real arXiv paper (2605.20636) extracts into a 19-nodePaperwith UUID ids and a correct section/subsection tree, then survives a JSON store round-trip.Checklist
$CATEGORY(xx): xxx(fix(flows): ...)🤖 Generated with Claude Code